Introduction

R is the world’s most popular (and best) statistical programming language. Open source and hugely community-driven, and is constantly expanded by the development of packages that add new functionality and implement new analysis methods and tools. New packages are added to the official curated package database CRAN every week.


Thanks to this flexibility, useability, and speed, R’s popularity has exploded in the last decade, especially in bioinformatics. Just look at the number of questions being asked about these three popular languages on the online bioinformatics community BioStars.


Those fancy graphics and appletts you see on news websites? Many of these are made in R, such as this collection from the BBC.




This session should set you up with the R skills to do analysis on biological data. In a few hours, you can challenge yourself to understand a line of code like this:

phenotype_data_1[ sample != "Wheat" , .( average_yield = mean(yield) , average_height = mean(height) ) , by=species ]

Make sure you do all the exercises, and avoid copying-and-pasting.

We will use R via an Integrated Development Environment or IDE, called Rstudio. It’s a nice visual tool that puts everything you need to know on one screen and makes programming much easier.

Getting started with the IDE

First, access the IDE through a web client. Open a web browser (e.g. Chrome) and type “bit-11:8787” into the URL bar. Log in with your course username and password.

NOTE: For this to work from home, you will have to be connected to the VPN.

To begin your project, open the template file I made for you in the workshop_001 directory. Use the open file icon in the top left and navigate to the file:




The top left panel of the IDE is where you work on files. You’ll fill them with lines of R script that can be run sequentially. You can now write R commands into the document. Run all the commands described as you learn them here, and also the commands you’ll write for the exercises. As you work, save with ctrl-s regularly.

You should see this at the top of your script:

#run this to get all your data and functions loaded
source("/filer/projekte/ggr-course2020/r-source/genres_2020_bioinfo_setup_fns_data.R")

The first line is called a comment, which is indicated because it starts with a hash (#) symbol. Anything after a hash is counted as a comment. Comments are not code: R completely ignores them! They are just useful notes to the programmer or the reader. Feel free to use them: Good code is full of helpful comments.

The second line is a command that will automatically load all the packages and data. To run a command, either highlight it or simply place the cursor on the same line, and press ctrl-<enter>. Do this now with the command above.

The lower left panel is the console, which is where R commands run. When you press ctrl-<enter>, the highlighted command is sent down to the console, and the output of the command is shown, below it in the console. If the command creates plot, the plot will be shown in the bottom right window.

The most simple commands are simple arithmetic. Type some simple expressions like these into your script and try running them with ctrl-<enter>.

1+1
1/2
5*5
4**2 #four to the power of two

Variables

Variables are a key concept in programming. A variable is like a box that stores some information. When you want to later access or use that information, you can simply use the variable’s name. You can name a variable almost anything you like, but informative names are always helpful.

Let’s create a variable called my_var and store the number 3 in it. To do it, we use the assignment operator (<-), which looks like an arrow pointing the way for the information to the box.

my_var <- 3

To check what my_var stores, you can just run it like a command.

my_var

Now, you can confirm that you can treat my_var as a substitute for the value it stores. Try a few operations, for instance:

my_var + 10
my_var * my_var

A single number is not a very interesting thing to store in a variable. Let’s define a “list” (technically called a vector) of numbers called my_vec. We do this by writing the list inside brackets preceeded by c (short for “combine”), separated by commas.

my_vec <- c(4,10,-100,2.4,1,-5,0)
my_vec #run it to make sure it contains the list

Vectorised operations

Having this list stored in my_vec will let us demonstrate a few more things. First, a true R specialty: vectorised operations.

Try running some simple arithmetic operations on my_vec.

my_vec * 4
my_vec / 100
my_vec + my_var
my_vec + my_vec

Notice how the operations act on all the entries in the vector? Think about the last example there, my_vec + my_vec. The result is a vector, made by adding the first element of my_vec to itself, and the same with the second, etc.

As with normal maths, operations can get quite tricky, and it’s best to include brackets to clarify what operations should go in what order. For instance, look at the results of these:

 my_vec + my_vec  * 2
(my_vec + my_vec) * 2

See? R defaults to doing the * first, but the brackets tell R to do the + first. So the results are different.

Functions

Secondly, functions. While simple operators like + and * perform straightforward things, functions are tools that perform more exciting and advanced things. Functions are always (ALWAYS!) called by a function name followed by brackets inside which is a list of arguments. For example:

seq( from = 1 , to = 10 , by = 2 )

This is a function for making vectors that are evenly-spaced sequences of numbers (e.g. 2, 4, 6, 8, 10 … ). Notice:

  • seq is the function name, and we know it is a function because it is followed by brackets
  • There are three arguments for this function: from, to, and length.out. Arguments tell the function what to do or give the function information it needs. The format is always argument1 = <something> , argument2 = <something>. Don’t forget to separate them with commas!

Run that command to see what it does, then try adjusting the values for each argument to get an understanding of what it does.

Once you’ve done that, try this:

seq(1,10,2)

Notice that it does exactly what the original command does. This is because you don’t strictly need to name the arguments. R will guess what the values are meant to mean based on their positions.

So, play around with these useful little functions and figure out what they do:

mean(my_vec)
sd(my_vec)
max(my_vec)
range(my_vec)
sum(my_vec)
rep(my_vec , times = 3 )
rep(my_vec , each = 3 )

Notice that not all functions return the same thing. Some return vectors, other single numbers. Some even return graphs and other pretty things, as we will see.

Logical operators

A VERY common theme in bioinformatics is asking yes/no questions about your data. For this reason, most programming languages include ways to formulate such questions so that a computer can understand them. They are constructed from a few simple, logical ideas that we use intuitively when we speak. In R, there are symbols for:

  • “Is greater than”: >
  • “Is less than”: <
  • “Is equal to”: ==
  • “AND”: &
  • “OR”: |
  • “NOT”: !

And a few clever combinations of these such as:

  • “Is greater than or equal to”: >=
  • “Is not equal to”: !=

Let’s try by making two new vectors and asking questions about them.

vec1 <- c(1,2,3,4,5,6,7)
vec2 <- c(7,6,5,4,3,2,1)

vec1 > 4

That should give you the result “FALSE FALSE FALSE FALSE TRUE TRUE TRUE”, meaning that the answer is FALSE for the first four elements of vec1 (1, 2, 3, and 4), and TRUE for the last three elements (5, 6, and 7).

If we want to add more conditions, we could use the AND operator (&).

vec1 > 4 & vec1 < 7

So, we get TRUE for all the elements which are greater than 4 AND less than 7 … so “FALSE FALSE FALSE FALSE TRUE TRUE FALSE”

We can also ask questions that compare the elements of vectors (if the vectors are the same length, of course).

vec1 == vec2

We get “FALSE FALSE FALSE TRUE FALSE FALSE FALSE”. Why? Well …

  • The first result was asking if the first element of vec1 (1) is equal to the first element of vec2 (7). It isn’t, so we get FALSE.
  • The second result was asking if the second element of vec1 (2) is equal to the second element of vec2 (6). It isn’t, so we get FALSE.
  • … etc …
  • The fourth result was asking if the fourth element of vec1 (4) is equal to the fourth element of vec2 (4). It is! So we get TRUE.
  • … etc …

Work through these, experiment, and make sure you understand the outcome. Commands like these will be very important for the rest of the course.

vec1 == 2
! vec1 == 2 #the NOT symbol (!) simply reverses the TRUE and FALSE outcomes of whatever comes after it.
vec1 != 2
! vec1 >= 5
vec1 == 1 & vec2 == 7
vec1 == mean(vec1)
sum(vec1) == sum(vec2)

Exercises:

  1. Use the seq() function to make a sequence that goes from 0 to 50 in jumps of 2 (0, 2, 4, … 50)
  2. Write the sequence into a variable called my_seq using the assignment operator (<-).
  3. Write a command that will add 100 to every element of my_seq
  4. Write a command to calculate the mean of my_seq and write it into a variable called seq_mean
  5. Write a command that will tell you which elements of my_seq are greater than seq_mean
  6. BONUS: Perform Z-score normalisation on my_seq! The Z-score of an element in a vector is equal to: the element’s value minus the mean of the vector, divided by the standard deviation of the vector.
    • HINT: Use the mean function mean() and the standard deviation function sd().
    • HINT: Remember to use brackets to clarify which operation (subtraction of the mean or division by the sd) comes first.

A programmer’s best friend!

Here are three tricks to make your life a lot easier.

Help documentation

To open a help window that explains the arguments of a function, run the function name with a question mark just before it. Try:

?seq

Building commands

The structure of a command can get complicated. Some have a lot of arguments or parts, and a single mistake will make them fail. To help, programmers usually don NOT write the commmand from left to right like a sentence. Instead, we write the structure of a command first, then fill in the blanks.

For instance, writing the seq() command above, I would do it in a few steps:

#step 1:
seq() # this ensures we have closed the brackets

#step 2:
seq( from = , to = , by = ) #this ensures we have all the argument names and commas in place

#step 3:
seq( from = 1 , to = 10 , by = 2 ) #fill in the blanks to complete the command

<tab> autocomplete:

Finally, the IDE is clever! It knows there are a limited number of things you could be typing, and it will try to autocomplete or suggest possibilities when you press the <tab> key. The <tab> key on my keyboard is blank because I use it so much.

For instance, type se and press <tab>. The IDE will give you a whole list of commands and variables beginning with ‘se’, including our friend seq():

Notice how this even gives you some info on the command. Now try my_v and then <tab>. You should get suggestions for variables beginning in those:

Useful! If there is only one option available, the command will be automatically completed. The <tab> trick also works on file names.

Getting creative

To get a taste of R’s flexibility, we can create some data, plot it, and even fit a model to it (we will fit a polynomial curve but R can fit almost any model you can think of). By running and experimenting with the following commands, answer the questions below.

x <- seq( from = 1 , to = 20 , by = 1 )
noise <-  rnorm( n = 20, mean = 0, sd = 1 )
y <- (1.1 ** x) + noise
plot( x , y )
fit_polynomial( x , y )

Exercises:

  1. What does the variable x store?
  2. What does the function rnorm() do?
    • What do each of the arguments do?
  3. What would the following commands produce?
    • noise
    • 1.1 ** x
    • (1.1 ** x) + noise
    • 1.1 ** (x + noise)
  4. The vectors noise and x are the same length (20). Why is this important?
    • HINT: Think about the command (1.1 ** x) + noise
  5. Modify the commands to make this example extend all the way from 1 to 100.

Well done! Now is a good time to grab a coffee and let your mind rest a few minutes.

The Amazing data.table

A data.table is one of the most useful data structures in statistical programming. They are very much like typical Microsoft Excel spreadsheets, where you might have rows representing experimental ‘units’ (observations, SNPs, plants, subjects, measurements … ).

Let’s dive in with some pretend data that looks almost identical to real genetics data. I’ve created a pre-loaded data.table stored in the variable phenotype_data_1. Have a look.

phenotype_data_1

It should look like this:



Basically, each row tells us phenotype information about a particular sample in a genetic diversity panel. The information is in columns. We have columns that tell us:

Perhaps you noticed that the columns look a bit like vectors, except that where we previously saw them on their sides like this … They now appear as column lists like this: … and that’s not a coincidence: They really are vectors, and the glory of a data.table is that we can very easily manipulate and analyse the data in very flexible and creative ways by referring to these vectors. That’s what we will learn now.

The three indexing fields

data.table’s power lies largely in its excellent indexing system. data.tables have three indexing fields that perform these three functions. Spend a minute memorising this basic structure:

data_table_name[ SELECT , OUTPUT , GROUP_BY ]

When you start writing a data.table command, just start by writing data_table_name[ , , ] (where data_table_name is just the name of the variable your data.table is stored in). Then fill in the fields. You can leave fields blank too.

1) The SELECT field.

This is used to select certain rows, and we do that by making logical statements about the columns, for instance try this:

phenotype_data_1[ height > 150 , , ]

This selects all the rows where height is above 150.

Here’s another example.

phenotype_data_1[ sample=="sample_5" , , ]

Selects the row where the sample is “sample_5”.

And you can use logical tests on all the columns to do your selection, for instance, maybe we’d like to see all the samples that can self-pollinate, but ignore the wheat samples, so we need to combine two logical statements with an AND operator (&). We also use the NOT operator (!) to cut out the wheat samples:

phenotype_data_1[ selfing==TRUE & ! species=="Wheat" ,  ,  ]

A good way to visualise selection that is just crosses out rows that fail the logical tests, like this.

Exercises:

  1. Select all the rows of phenotype_data_1 where the yield is above 10.
  2. Select all the rows of phenotype_data_1 where the yield is above 10, or the height is above 100.
  3. Select all the rows where the yield is between 8 and 10.
    • HINT: Use the AND operator & in conjunction with less than (<) and a greater than (>) operators.
  4. BONUS: Select every barley row where the yield is above the mean yield for the whole table.
  5. BONUS: Select every barley row where the yield is above 80% of the mean yield for the whole table.

2) The OUTPUT field.

By default, a data.table command will just returns the rows of the data.table you selected (or all the rows if you didn’t use the select field).

Instead, you might want to actually do something with the columns of the data.table, for example, finding the mean of maximum value of a column. Our results will become the columns of the result, and we get to name our results columns. Try this command:

For example, we could just return one of the columns as a vector, try this:

phenotype_data_1[ , .( average_yield = mean(yield) , tallest_plant_height = max(height) ) , ]

As you can see, the format of the command is:

data_table_name[ , .( column_name_1 = , column_name_2 = ) , ]

So, the whole OUTPUT field is enclosed in brackets with a dot in front of them: .( ), and inside the brackets is a comma-separated list of output column names (you choose these!), and each has its own command telling R how to generate the output you want in this column.

Remember the “building commands” section above? That is useful here. I would progress through typing my command like this.

#set up the basic structure
phenotype_data_1[ , , ]

#since I want to use the output column, set up the brackets
phenotype_data_1[ , .() , ]

#put in the names of the output columns I want, separated by commas
phenotype_data_1[ , .( mean_yield = , max_yield = ) , ]

#fill in the commands for each column
phenotype_data_1[ , .( average_yield = mean(yield), tallest_plant_height = max(yield) ) , ]

Your turn …

Exercises

  1. Build a command as shown above (no copy-pasting!), but this time produce three columns:
    1. A column called “minimum_height” that contains the minimum height.
    2. A column called “total_yield” that contains the sum of the yields.
    3. A column called “SD_height” that contains the standard deviation of all the heights.
      • HINT: You will use the min(), sum() and sd() functions.
  2. Write a command that produces just one column called “yield_per_height”, which contains the yield divided by the height for each sample. You will notice the result has lots of rows, where the last result had only one row. Why? BONUS: Try the command from question 4) WITHOUT the .( ) surrounding the command. What happens? Can you work out now why the .( ) is necessary when producing more than one column? Ask Tim if your guess is correct.

You have probably guessed, you can combine the use of the SELECT and OUTPUT fields. Let’s build a command to calculate the average yield of wheat samples (if you feel confident, try it on your own first!)

#set up the basic structure
phenotype_data_1[ , , ]

#Fill in the select column to choose only wheat (I would then run the command to test the selection works)
phenotype_data_1[ sample=="Wheat" ,  ,  ]

#since I want to use the output column, set up the brackets
phenotype_data_1[ sample=="Wheat" , .() , ]

#set up myu output column name
phenotype_data_1[ sample=="Wheat" , .( average_wheat_yield = ) , ]

#fill in the command
phenotype_data_1[ sample=="Wheat" , .( average_wheat_yield = mean(yield) ) , ]

Exercise

  • Build a command to output the maximum height of all the rye samples.

3) The GROUP_BY field.

Imagine we want to compare the yields of different species. We would need to calculate the average yield of not just of one species, but of every species. The third field, GROUP_BY lets us divide the data.table up into groups and do OUTPUT operations in each group separately. To use it, put by=<grouping column name> in the GROUP_BY field.

So to calculate the mean yields of each species, our grouping column will be “species”. Build up your command like this:

#structure
phenotype_data_1[ , , ]

#THINK: "Do I want to select specific rows?" Since we want all the rows, leave the SELECT field empty

#Since we want to output something, set up the OUTPUT field
phenotype_data_1[ , .() , ]

#name columns
phenotype_data_1[ , .( average_yield = ) , ]

#add command
phenotype_data_1[ , .( average_yield = mean(yield) ) , ]

#THINK: "Do I want to calculate the output separately in group?" Yes, so add the grouping column in the GROUP_BY field
phenotype_data_1[ , .( average_yield = mean(yield) ) , by=species ]

And we get this:

If you want to visualise this, imagine sorting the table into groups, then doing your calculations separately on the groups:

And, of course, we can incorporate the SELECT field too, for instance, we could do the same command by exclude samples with height below 80:

phenotype_data_1[ ! height < 90 , .( average_yield = mean(yield) ) , by=species ]

Which could be visualised like this:

Exercise:

  • Build a command that will output the average height for the groups of samples which do and don’t self-pollinate.

Adding columns to data.tables

You now know enough about data.tables to do some really meaningful analysis, which we will use in the next workshop. For that workshop (and for the assignment), you will sometimes need to add a column to a data.table. Luckily, it’s quite simple.

You specify what you want in the new column in the OUTPUT field, just like before, except:

  • You can only add ONE column per command.
  • Don’t use the .( ) that usually surrounds commands in the OUTPUT field.
  • Instead of the normal equals symbol (=), use the special column-adding equals :=.

So, I could add a column called height_m to phenotype_data_1 that gives the height in metres. To get that, we just divide the heights in cm by 100, so the command would be:

phenotype_data_1[  , height_m := height / 100 ,  ]

Try it and then have a look to make sure it added the column. Then …

Exercise

  • Add a column called yield_dt to phenotype_data_1 that gives the yield in tonnes per hectare (dt/h). To get dt/h from the yield column, multiply the values by 10.
  • BONUS: Adding columns always returns a data.table with the same number of rows as the original table. But if you use SELECT or GROUP_BY, the results can often have fewer rows than the original data.table. Write some commands that combine adding a column with the SELECT and/or GROUP_BY features that can help you work out how R deals with this situation.

Great work! The last section for today is about 15–30 mintes work. Now is a good time for another coffee break.

Plotting with ggplot

Last of all for this session, we’ll learn the very basics of making scatterplots with the very popular package ggplot2. This package can produce some very beautiful and complex plots with just one command works wonderfully with data in data.table-type formats.

A typical ggplot2 command for a scatterplot looks like this:

ggplot( data=<data.table name> , mapping=aes( x= , y= ) ) + geom_point()

So, <data.table name> here is the data.table you want to plot data from, <column to go on x axis> is the column that provide values for the x-axis, and <column to go on x axis> is the column that will provide values for the y-axis.

Once again: Start building commands by typing the structure, and then fill in the blanks.

Imagine we want to know if there are differences between yields for the different species. Try this command where we put the species on the x-axis and yields on the y-axis.

#start like this, with the structure
ggplot( data=  , mapping=aes( x=  , y= ) ) + geom_point()

#and fill in the blanks
ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield ) ) + geom_point()

So that’s the basic idea. It gets interesting when we start using more columns to add more visual information. For instance, we might wonder whether selfing samples have on average a smaller or larger yield. We could add that information in the graph, too—perhaps as colour or shape? As it happens, the “aes” in the aes( ... ) part of the command stands for “aesthetic”, because it maps (see why the argument name is “mapping”?) certain columns to certain aesthetic features. So we can, for instance, map selfing to colour in this box.

ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield , colour=selfing ) ) + geom_point()

#or map selfing to shape!
ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield , shape=selfing ) ) + geom_point()

And it’s possible to go crazy with aesthetics, for instance, let’s use as many as we can to represent the entire dataset on one plot:

ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield , shape=selfing , colour=sample , size=height ) ) + geom_point()

Of course, this is a bit stupid. It’s not a very intuitive or informative plot … choosing aesthetics is an art.

Exercises

  1. Create a scatterplot that looks for a relationship between yield and height by placing one of them on the x-axis and the other one on the y-axis
  2. Add a colour aesthetic to show which data are from which species
  3. Add another aesthetic of your choice to show which data are selfing
ggplot( data=phenotype_data_1 , mapping = aes( x=height , y=yield , colour=species , shape=selfing ) ) + geom_point() + geom_smooth(method="lm")

A showcase of ggplot’s flexibility

Just to round up, let’s have a look at a few of ggplot’s built-in abilities. since it’s the end of the day, just copy-paste these commands to run them, and we’ll discuss the actual results in biological terms.

Here’s the original plot we made of species vs. yield.

ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield ) ) + geom_point()

That + geom_point() means “add points to the graph”. We can add different–or multiple–“geoms” to the plots. For instance, e can make a classic boxplot by adding + geom_boxplot()

ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield ) ) + geom_boxplot()

Like many statisticians, I personally hate boxplots: they hide more than they reveal. I would prefer something that shows a more honest representation of the spread of data in each group. So naturally, I love geom_violin(). Check it out.

ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield ) ) + geom_violin()

And even better, we can show both the raw data, and the violin plots, by using both geom_point() and geom_violin() (with + geom_point() second, so the points are drawn on top of the violins).

ggplot( data=phenotype_data_1 , mapping = aes( x=species , y=yield ) ) + geom_violin() + geom_point()

Ggplot will even fit statistical models to your data. Here’s a plot of yield vs. height once more.

ggplot( data=phenotype_data_1 , mapping = aes( x=height , y=yield , colour=species ) ) + geom_point()

You can see some obvious trends in it, for instance, shorter species have higher yields, but within each species taller plants have higher yields. With geom_smooth() we can ask ggplot to fit a model to the data. In this command, we will specify a simple linear model using the argument method="lm".

ggplot( data=phenotype_data_1 , mapping = aes( x=height , y=yield , colour=species ) ) + geom_point() + geom_smooth(method="lm")

To finish up, let’s change mental gears and discuss this data in a biological context. The data are just made up, but are they realistic? Well … sort of. Relationships between agronomic factors like grain yield, plant biomass, height, genetic makeup, and so on are notoriously complex and interdependent. The relationships are often non-linear, and they can drastically change based on soil type, climatic and weather conditions, nutrient availability and so on. However, the trends we see here do reflect relationships that sometimes occur in real life.

For instance, note these three things:

  1. The wheat have the highest overall yields, followed by barley, and then rye. This is the case, for instance, in Germany (from the excellent site FOASTAT).

    This is also the case in most countries that grow these crops such as the USA, Poland, and the Russian Federation. Of course, these countries have varied environments and growers will likely plant whatever crops yield best in their area (for instance, wheat tolerates dry soils and rye is excellent in the cold). And economic factors of supply and demand affect their choices, along with crop rotation and sowing/harvesting schedules … and many other factors.

  2. In our data, the wheat have low variability in yield and height, while the rye is notably variable. Naturally this would depend on our samples. If you compare samples from a single isogenic line with a collection of wild samples, the wild samples will always be more variable. But this variability does somewhat reflect the different breeding systems that are prevalent among these species. Wheat (well, bread wheat) is a domesticated species that normally self-pollinates, limiting its genetic diversity. Rye is semi-wild and often self-incompatible, so it has more genetic diversity, and hence more variation in growth form.

  3. What about the positive relationship between height and yield within each of the species? That is also semi-realistic, but reality is usually more complex. Consider this chart from Addisu et al. (2009):

    Each dot represents a crop of wheat plants. They are genetically almost identical, except for variation at a gene that affects the height. Each panel shows the results for one year. The black dots show conventionally-grown plants, and the open dots are organically grown.

    Obviously the organic methods don’t yield as much as the conventional methods, and there is some height/yield relationship. But this relationship changes every year. In some years ((c) and (d)) it looks like maybe the best height is about 80 cm. Taller than this, and the plants tend to lodge (fall over) and the yield decreases. This is in fact part of the reason why so-called dwarfing genes were such a key part of the green revolution of the 1950s and 1960s: they help plants to attain this “sweet spot” of size for maximum yield. In panel (e) it looks like the sweet spot is around 100 cm … but there are no data at 80 cm so it’s hard to say.



> The lesson is this: the more data we have to analyse the better. And big data requires analysis using the very tools we’ve been using today. You can do a lot of good for the world with just a little coding skill and good data literacy.





Rye and barley growing together here at IPK. Based on what you now know–can you tell which is which?

Rye and barley growing together here at IPK. Based on what you now know–can you tell which is which?